Data Source : https://data.bts.gov/Research-and-Statistics/Freight-Transportation-Service-Index/n68x-u7m7
The aim of this project is to develop a forecasting model for the Transportation Services Index (TSI) based on historical data. The TSI, produced by the Bureau of Transportation Statistics (BTS), serves as a measure of the volume of freight and passenger transportation services moved monthly by the for-hire transportation sector in the United States. This index is composed of three components: a freight index, a passenger index, and a combined index, each incorporating data from various for-hire transportation modes.
The primary focus of this project is to forecast the TSI values exclusively, while disregarding the other components of the index. Changes in the TSI reflect fluctuations excluding the external schoks in the demand for transportation services, which are indicative of broader economic trends. For instance, during periods of economic expansion, there is typically an increase in the demand for goods and services, leading to a corresponding rise in the TSI.
To achieve this objective, we will utilize time series analysis techniques to model the historical TSI data and generate forecasts for future TSI values. Time series analysis allows us to identify patterns, trends, and seasonal variations in the data, which are crucial for developing accurate forecasting models. We will explore various time series forecasting methods, such as autoregressive integrated moving average (ARIMA), exponential smoothing methods.
The forecasted TSI values will provide valuable insights for stakeholders in the transportation industry, policymakers, and economic analysts. By anticipating changes in the demand for transportation services, informed decisions can be made regarding infrastructure investments, capacity planning, and economic policy formulation. Additionally, understanding future trends in transportation demand can help businesses optimize their supply chain management strategies and logistics operations.
Overall, this project aims to contribute to a better understanding of the dynamics of transportation services demand and provide actionable forecasts to support decision-making processes in both the public and private sectors.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(tseries)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
transportation <- read_csv("Transportation_Services_Index_and_Seasonally.csv")
## Rows: 289 Columns: 66
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): ID, OBS_DATE
## dbl (64): ASM, ASM_D11, ASM_D, ASM_D_D11, ASM_I, ASM_I_D11, RPM, RPM_D11, RP...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
transportation
## # A tibble: 289 × 66
## ID OBS_DATE ASM ASM_D11 ASM_D ASM_D_D11 ASM_I ASM_I_D11 RPM
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 SATD200001 01/01/2000 7.61e7 7.79e7 5.63e7 57569024 1.99e7 20363265 4.83e7
## 2 SATD200002 02/01/2000 7.30e7 7.86e7 5.43e7 57825860 1.87e7 20735310 4.86e7
## 3 SATD200003 03/01/2000 8.03e7 7.94e7 5.99e7 58447159 2.04e7 20919253 5.96e7
## 4 SATD200004 04/01/2000 7.81e7 7.94e7 5.74e7 58407037 2.07e7 20962932 5.76e7
## 5 SATD200005 05/01/2000 8.05e7 7.91e7 5.89e7 58182279 2.16e7 20908676 5.97e7
## 6 SATD200006 06/01/2000 8.04e7 7.89e7 5.83e7 57589554 2.21e7 21294849 6.40e7
## 7 SATD200007 07/01/2000 8.39e7 7.91e7 6.06e7 57759243 2.32e7 21307529 6.64e7
## 8 SATD200008 08/01/2000 8.47e7 7.93e7 6.15e7 57934341 2.32e7 21386027 6.53e7
## 9 SATD200009 09/01/2000 7.99e7 8.15e7 5.79e7 59928453 2.20e7 21587317 5.55e7
## 10 SATD200010 10/01/2000 8.26e7 8.12e7 6.08e7 59426878 2.19e7 21731495 5.80e7
## # ℹ 279 more rows
## # ℹ 57 more variables: RPM_D11 <dbl>, RPM_D <dbl>, RPM_D_D11 <dbl>,
## # RPM_I <dbl>, RPM_I_D11 <dbl>, LOAD_FACTOR <dbl>, LOAD_FACTOR_D11 <dbl>,
## # LOAD_FACTOR_D <dbl>, LOAD_FACTOR_D_D11 <dbl>, LOAD_FACTOR_I <dbl>,
## # LOAD_FACTOR_I_D11 <dbl>, ENPLANE_I <dbl>, ENPLANE_I_D11 <dbl>,
## # ENPLANE_D <dbl>, ENPLANE_D_D11 <dbl>, ENPLANE <dbl>, ENPLANE_D11 <dbl>,
## # AIR_RPM_TSI <dbl>, AIR_RPM_TSI_D11 <dbl>, AIR_RTMFM <dbl>, …
From the above data set we are interested in performing the analysis on the TSI_Total variable.
Below are the time series, acf, pacf and covariance plots of tsi_total variable
ts_transportation <- ts(transportation$TSI_Total, start = c(2000, 1), frequency = 12, end = c(2023, 12))
plot(ts_transportation,main= " ",
xlab= "Time (months)", ylab="TSI_Total",type="o")
par(mfrow = c(2, 2))
ts.plot(ts_transportation)
acf(ts_transportation)
pacf(ts_transportation)
acf(ts_transportation,type="covariance")
par(mfrow = c(2, 2))
plot(diff(ts_transportation))
acf(diff( ts_transportation))
pacf(diff(ts_transportation))
acf(diff(ts_transportation),type="covariance")
From the above regular time series plot we can see that the data is not
stationary as the trend is moving upward gradually however fallen down
during the covid period also, the acf and pacf plots does not looks
stationary as there are significant lags over the period. After
differencing the data we can see the data is stationary except at the
covid time also the acf and pacf of the differenced time period looks
good where as for the pacf plot indicates that the lag is significant at
lag 1.
To make the plot normal logarithm transformation was applied on the data however it does not normalize the data close to zero which we can observe in the acf and pacf plots.
par(mfrow = c(2, 2))
ts.plot(log(ts_transportation))
acf(log(ts_transportation))
pacf(log(ts_transportation))
acf(log(ts_transportation),type="covariance")
As our analysis is mostly focusing with no external skocks. Filtered the data set till the year 2020 where it faced the external effect of pandemic later to it.
Following are the plots to identify and analyze the correlation structure of time series filtered data where we can observe that the acf plot has many significant lags where as for the pacf plot we can observe that the first lag is significant.
ts_transportation_filtered <- ts_transportation[1:204]
par(mfrow = c(2, 2))
ts.plot(ts_transportation_filtered)
acf(ts_transportation_filtered)
pacf(ts_transportation_filtered)
acf(ts_transportation_filtered,type="covariance")
Following are the moving average for three and six months respectively
on the time series data to understand the smoothing feature of the
data.
library(zoo)
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
# Calculate 3-month moving average
moving_avg <- rollmean(ts_transportation_filtered, k = 3, align = "center", fill = NA)
# Plot original time series and moving average
plot(ts_transportation_filtered, main="Original Time Series vs. 3-Month Moving Average", ylab="Transportation Index")
lines(moving_avg, col="red")
legend("topright", legend=c("Original", "Moving Average"), col=c("black", "red"), lty=1)
# Calculate 6-month moving average
moving_avg <- rollmean(ts_transportation_filtered, k = 6, align = "center", fill = NA)
# Plot original time series and moving average
plot(ts_transportation_filtered, main="Original Time Series vs. 6-Month Moving Average", ylab="Transportation Index")
lines(moving_avg, col="red")
legend("topright", legend=c("Original", "Moving Average"), col=c("black", "red"), lty=1)
Following are the time series, acf and pacf plots of the filtered time
series plot and the data looks stationary from the acf plot first lag is
significant where as the pacf plot has first lag is significant.
par(mfrow = c(2, 2))
ts.plot(diff(ts_transportation_filtered))
acf(diff(ts_transportation_filtered))
pacf(diff(ts_transportation_filtered))
acf(diff(ts_transportation_filtered),type="covariance")
adf.test((ts_transportation_filtered))
##
## Augmented Dickey-Fuller Test
##
## data: (ts_transportation_filtered)
## Dickey-Fuller = -2.1091, Lag order = 5, p-value = 0.5304
## alternative hypothesis: stationary
adf.test(diff(ts_transportation_filtered))
## Warning in adf.test(diff(ts_transportation_filtered)): p-value smaller than
## printed p-value
##
## Augmented Dickey-Fuller Test
##
## data: diff(ts_transportation_filtered)
## Dickey-Fuller = -4.6921, Lag order = 5, p-value = 0.01
## alternative hypothesis: stationary
After performing the statistical test of adf to check for the stationarity I see the data is not stationary for the non differenced data as the result is insignificant and the p-value is above 0.05 and for the difference data we can observe that it is stationary as the result is significant and reject the null hypothesis of stationarity.
Splitted the data into the train and the test data sets
train_transportation <- ts_transportation_filtered[1:180] # take filtered data
test_transportation <- ts_transportation_filtered[180 : 203]
time_train_transportation <- time(train_transportation)
time_test_transportation <- time(test_transportation)
Applied the linear models on top of the data where the time series data is the response variable whereas the time object is on the predicotr variable.
mlr.lin = lm( train_transportation ~ time_train_transportation)
summary(mlr.lin)
##
## Call:
## lm(formula = train_transportation ~ time_train_transportation)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.8749 -2.2539 -0.0125 3.4972 6.8250
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 99.257784 0.606496 163.66 <2e-16 ***
## time_train_transportation 0.101921 0.005812 17.54 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.052 on 178 degrees of freedom
## Multiple R-squared: 0.6334, Adjusted R-squared: 0.6313
## F-statistic: 307.5 on 1 and 178 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(mlr.lin, main="",which = 1:4)
The linear regression model indicates a strong positive relationship
between time and train transportation, with each unit increase in time
associated with a 0.102 unit increase in train transportation, holding
other variables constant. The model explains approximately 63.34% of the
variability in train transportation, with both the intercept and time
coefficient being statistically significant predictors (p < 0.001).
However, the residual plots does not looks good as there is a huge
spikes in the residual plot from center.
time_sqft_train_transportation = time_train_transportation^2/factorial(2)
mlr.quad=lm( train_transportation ~ time_train_transportation+time_sqft_train_transportation )
summary(mlr.quad)
##
## Call:
## lm(formula = train_transportation ~ time_train_transportation +
## time_sqft_train_transportation)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.0657 -2.1983 0.1021 3.4398 6.6659
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 98.7801811 0.9174515 107.668 < 2e-16 ***
## time_train_transportation 0.1176665 0.0234037 5.028 1.21e-06 ***
## time_sqft_train_transportation -0.0001740 0.0002505 -0.695 0.488
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.057 on 177 degrees of freedom
## Multiple R-squared: 0.6344, Adjusted R-squared: 0.6303
## F-statistic: 153.6 on 2 and 177 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(mlr.quad, main="",which = 1:4)
The multiple linear regression model shows that for each unit increase
in time_train_transportation, train transportation increases by
approximately 0.118 units, holding other variables constant. However,
the coefficient for time_sqft_train_transportation is not statistically
significant (p = 0.488), suggesting it does not significantly impact
train transportation. The overall model, including both predictors,
explains approximately 63.03% of the variability in train
transportation, with a statistically significant F-statistic (<
2.2e-16). However, the residual plots does not looks good as there is a
huge spikes in the residual plot from the center and is not normal.
time_cubic_train_transportation= time_train_transportation^3/factorial(3)
mlr.cubic=lm( train_transportation ~ time_train_transportation+time_sqft_train_transportation+time_cubic_train_transportation )
summary(mlr.cubic)
##
## Call:
## lm(formula = train_transportation ~ time_train_transportation +
## time_sqft_train_transportation + time_cubic_train_transportation)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.3332 -2.3110 0.5942 2.8546 7.9898
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.345e+01 1.084e+00 86.223 < 2e-16 ***
## time_train_transportation 4.663e-01 5.171e-02 9.017 3.22e-16 ***
## time_sqft_train_transportation -9.778e-03 1.326e-03 -7.375 6.19e-12 ***
## time_cubic_train_transportation 1.061e-04 1.445e-05 7.346 7.34e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.56 on 176 degrees of freedom
## Multiple R-squared: 0.7202, Adjusted R-squared: 0.7154
## F-statistic: 151 on 3 and 176 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(mlr.cubic, main="",which = 1:4)
The multiple linear regression model with time_train_transportation,
time_sqft_train_transportation, and time_cubic_train_transportation as
predictors indicates that, for each unit increase in
time_train_transportation, train transportation increases by
approximately 0.466 units, holding other variables constant.
Additionally, both time_sqft_train_transportation and
time_cubic_train_transportation have statistically significant negative
coefficients, suggesting that increases in these variables are
associated with decreases in train transportation. The overall model
explains approximately 71.54% of the variability in train
transportation, with a significant F-statistic (p < 2.2e-16).
However, the residual plots does not looks good as there is a huge
spikes in the residual plot from the center and is not normal.
time_four_train_transportation= time_train_transportation^4/factorial(4)
mlr.four=lm( train_transportation ~ time_train_transportation+time_sqft_train_transportation+time_cubic_train_transportation+time_four_train_transportation )
summary(mlr.four)
##
## Call:
## lm(formula = train_transportation ~ time_train_transportation +
## time_sqft_train_transportation + time_cubic_train_transportation +
## time_four_train_transportation)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.085 -1.595 1.072 2.263 5.097
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.849e+01 1.228e+00 80.230 < 2e-16 ***
## time_train_transportation -7.737e-02 9.343e-02 -0.828 0.409
## time_sqft_train_transportation 1.708e-02 4.182e-03 4.085 6.70e-05 ***
## time_cubic_train_transportation -5.851e-04 1.040e-04 -5.626 7.19e-08 ***
## time_four_train_transportation 7.638e-06 1.140e-06 6.698 2.77e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.185 on 175 degrees of freedom
## Multiple R-squared: 0.7773, Adjusted R-squared: 0.7722
## F-statistic: 152.7 on 4 and 175 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(mlr.four, main="",which = 1:4)
The multiple linear regression model with time_train_transportation,
time_sqft_train_transportation, time_cubic_train_transportation, and
time_four_train_transportation as predictors indicates that only
time_sqft_train_transportation and time_cubic_train_transportation have
statistically significant coefficients. Specifically, for each unit
increase in time_sqft_train_transportation and
time_four_train_transportation, train transportation increases by
approximately 0.01708 and decreases by approximately 0.0005851,
respectively. However, the residual plots does not looks good as there
is a huge spikes in the residual plot from the center and is not
normal.
Following are the plots fitted by all the linear, quadratic, cubic and bi-quadratic polynomial.
par(mfrow=c(2,2))
ts.plot(train_transportation) # Time Series Plot
# Plot of xfit vs mlr.lin$fitted
plin=cbind(train_transportation,mlr.lin$fitted)
ts.plot(plin,main="xfit and fit.linear")
pquad=cbind(train_transportation,mlr.quad$fitted)
ts.plot(pquad,main="xfit and fit.quadratic")
pcub=cbind(train_transportation,mlr.cubic$fitted)
ts.plot(pcub,main="xfit and fitt.cubic")
# Create a dataframe to store the model summary information
model_summary <- data.frame(
Model = c("Linear", "Quadratic", "Cubic", "4th Degree"),
Multiple_R_squared = c(0.6334, 0.6344, 0.7202, 0.7773),
Adjusted_R_squared = c(0.6313, 0.6303, 0.7154, 0.7722),
F_statistic = c(307.5, 153.6, 151, 152.7),
Residual_DF = c(178, 177, 176, 175),
Residual_standard_error = c(4.052, 4.057, 3.56, 3.185)
)
# Print the dataframe
print(model_summary)
## Model Multiple_R_squared Adjusted_R_squared F_statistic Residual_DF
## 1 Linear 0.6334 0.6313 307.5 178
## 2 Quadratic 0.6344 0.6303 153.6 177
## 3 Cubic 0.7202 0.7154 151.0 176
## 4 4th Degree 0.7773 0.7722 152.7 175
## Residual_standard_error
## 1 4.052
## 2 4.057
## 3 3.560
## 4 3.185
This table summarizes the performance of different regression models based on their multiple R-squared, adjusted R-squared, F-statistic, residual degrees of freedom, and residual standard error.
AIC.lin = AIC(mlr.lin)
AIC.quad = AIC(mlr.quad)
AIC.cubic = AIC(mlr.cubic)
print(AIC.lin)
## [1] 1018.483
print(AIC.quad)
## [1] 1019.993
print(AIC.cubic)
## [1] 973.8577
print("---------------")
## [1] "---------------"
BIC.lin = BIC(mlr.lin)
BIC.quad = BIC(mlr.quad)
BIC.cubic = BIC(mlr.cubic)
print(BIC.lin)
## [1] 1028.062
print(BIC.quad)
## [1] 1032.765
print(BIC.cubic)
## [1] 989.8224
The code calculates the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) for three different regression models: linear, quadratic, and cubic.
Overall, both AIC and BIC suggest that the cubic model provides the best fit among the three models.
#linear
new <- data.frame(time_train_transportation = 205:288)
# use predict function
pfore.lin=predict(mlr.lin,new,se.fit = TRUE)
efore.lin=test_transportation-pfore.lin$fit
## Warning in test_transportation - pfore.lin$fit: longer object length is not a
## multiple of shorter object length
nfore=length(test_transportation)
me.lin=mean(efore.lin) # Mean Error
mpe.lin=100*(mean(efore.lin/test_transportation)) # Mean Percent Error
## Warning in efore.lin/test_transportation: longer object length is not a
## multiple of shorter object length
mse.lin=sum(efore.lin**2)/nfore # Mean Squared Error
mae.lin=mean(abs(efore.lin)) # Mean Absolute Error
mape.lin=100*(mean(abs((efore.lin)/test_transportation))) # Mean Absolute Percent Error
## Warning in (efore.lin)/test_transportation: longer object length is not a
## multiple of shorter object length
me.lin
## [1] -1.540922
mpe.lin
## [1] -1.25843
mse.lin
## [1] 31.23893
mae.lin
## [1] 2.455101
mape.lin
## [1] 2.000836
The linear regression model exhibits a slight negative bias, with an average error of approximately -1.54. The average percentage error relative to the actual values is approximately -1.26%. The model’s predictions have an average squared error of approximately 31.24 and an average absolute error of approximately 2.46. The average absolute percentage error relative to the actual values is approximately 2.00%.
#quadratic
matq=matrix(c(time_train_transportation,time_sqft_train_transportation),nrow=84,ncol=2,dimnames = list(c(),c("time_train_transportation","time_sqft_train_transportation")))
## Warning in matrix(c(time_train_transportation, time_sqft_train_transportation),
## : data length [360] is not a sub-multiple or multiple of the number of rows
## [84]
matq <- data.frame(matq)
pfore.quad=predict(mlr.quad,matq,se.fit = TRUE)
efore.quad=test_transportation-pfore.quad$fit
## Warning in test_transportation - pfore.quad$fit: longer object length is not a
## multiple of shorter object length
nfore=length(test_transportation)
me.quad=mean(efore.quad) # Mean Error
mpe.quad=100*(mean(efore.quad/test_transportation)) # Mean Percent Error
## Warning in efore.quad/test_transportation: longer object length is not a
## multiple of shorter object length
mse.quad=sum(efore.quad**2)/nfore # Mean Squared Error
mae.quad=mean(abs(efore.quad)) # Mean Absolute Error
mape.quad=100*(mean(abs((efore.quad)/test_transportation))) # Mean Absolute Percent Error
## Warning in (efore.quad)/test_transportation: longer object length is not a
## multiple of shorter object length
me.quad
## [1] 19.08148
mpe.quad
## [1] 15.53051
mse.quad
## [1] 1304.153
mae.quad
## [1] 19.08148
mape.quad
## [1] 15.53051
For the quadratic regression model, the average error is approximately 19.08 units. The average percentage error relative to the actual values is approximately 15.53%. The model’s predictions have an average squared error of approximately 1304.15 and an average absolute error of approximately 19.08 units. The average absolute percentage error relative to the actual values is approximately 15.53%.
#cubic
matq=matrix(c(time_train_transportation,time_sqft_train_transportation, time_cubic_train_transportation),nrow=84,ncol=3,dimnames = list(c(),c("time_train_transportation","time_sqft_train_transportation", "time_cubic_train_transportation")))
## Warning in matrix(c(time_train_transportation, time_sqft_train_transportation,
## : data length [540] is not a sub-multiple or multiple of the number of rows
## [84]
matq <- data.frame(matq)
pfore.cubic=predict(mlr.cubic,matq,se.fit = TRUE)
efore.cubic=test_transportation-pfore.cubic$fit
## Warning in test_transportation - pfore.cubic$fit: longer object length is not a
## multiple of shorter object length
nfore=length(test_transportation)
me.cubic=mean(efore.cubic) # Mean Error
mpe.cubic=100*(mean(efore.cubic/test_transportation)) # Mean Percent Error
## Warning in efore.cubic/test_transportation: longer object length is not a
## multiple of shorter object length
mse.cubic=sum(efore.cubic**2)/nfore # Mean Squared Error
mae.cubic=mean(abs(efore.cubic)) # Mean Absolute Error
mape.cubic=100*(mean(abs((efore.cubic)/test_transportation))) # Mean Absolute Percent Error
## Warning in (efore.cubic)/test_transportation: longer object length is not a
## multiple of shorter object length
me.cubic
## [1] 10.72762
mpe.cubic
## [1] 8.733014
mse.cubic
## [1] 835.2739
mae.cubic
## [1] 12.64123
mape.cubic
## [1] 10.29451
For the cubic regression model, the average error is approximately 10.73 units. The average percentage error relative to the actual values is approximately 8.73%. The model’s predictions have an average squared error of approximately 835.27 and an average absolute error of approximately 12.64 units. The average absolute percentage error relative to the actual values is approximately 8.73%
library(dplyr)
# Create data frames for AIC and BIC values
aic_bic <- data.frame(
Model = c("Linear", "Quadratic", "Cubic"),
AIC = c(1018.483, 1028.062, 973.8577),
BIC = c(1019.993, 1032.765, 989.8224)
)
# Create data frames for model metrics
linear_metrics <- data.frame(
Model = "Linear",
ME = -1.540922,
MPE = -1.25843,
MSE = 31.23893,
MAE = 2.455101,
MAPE = 2.000836
)
quadratic_metrics <- data.frame(
Model = "Quadratic",
ME = 19.08148,
MPE = 15.53051,
MSE = 1304.153,
MAE = 19.08148,
MAPE = 15.53051
)
cubic_metrics <- data.frame(
Model = "Cubic",
ME = 10.72762,
MPE = 8.733014,
MSE = 835.2739,
MAE = 12.64123,
MAPE = 10.29451
)
# Combine all data frames into a single table
combined_table <- bind_rows(aic_bic, linear_metrics, quadratic_metrics, cubic_metrics)
# Print the combined table
print(combined_table)
## Model AIC BIC ME MPE MSE MAE
## 1 Linear 1018.4830 1019.9930 NA NA NA NA
## 2 Quadratic 1028.0620 1032.7650 NA NA NA NA
## 3 Cubic 973.8577 989.8224 NA NA NA NA
## 4 Linear NA NA -1.540922 -1.258430 31.23893 2.455101
## 5 Quadratic NA NA 19.081480 15.530510 1304.15300 19.081480
## 6 Cubic NA NA 10.727620 8.733014 835.27390 12.641230
## MAPE
## 1 NA
## 2 NA
## 3 NA
## 4 2.000836
## 5 15.530510
## 6 10.294510
Based on the provided table, the cubic model appears to be the best choice among the three models (linear, quadratic, and cubic). It has the lowest AIC and BIC values, indicating better fit compared to the other models. Additionally, it has lower error metrics such as mean absolute error (MAE) and mean squared error (MSE), suggesting better predictive performance.
Following is the code to fit the seasonality
per=12
sets=length(time_train_transportation)/per
month=factor(rep(1:per,sets))
time_cubic_train_transportation= time_train_transportation^3/factorial(3)
mlr.cubic.month=lm( train_transportation ~ time_train_transportation+time_sqft_train_transportation+time_cubic_train_transportation +month )
summary(mlr.cubic.month)
##
## Call:
## lm(formula = train_transportation ~ time_train_transportation +
## time_sqft_train_transportation + time_cubic_train_transportation +
## month)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.4747 -1.9656 0.5455 2.9171 7.7183
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.372e+01 1.408e+00 66.559 < 2e-16 ***
## time_train_transportation 4.713e-01 5.350e-02 8.809 1.67e-15 ***
## time_sqft_train_transportation -9.907e-03 1.372e-03 -7.222 1.79e-11 ***
## time_cubic_train_transportation 1.075e-04 1.495e-05 7.194 2.10e-11 ***
## month2 1.848e-01 1.338e+00 0.138 0.890
## month3 -9.648e-02 1.338e+00 -0.072 0.943
## month4 -3.972e-01 1.338e+00 -0.297 0.767
## month5 -2.108e-01 1.339e+00 -0.157 0.875
## month6 -2.973e-01 1.339e+00 -0.222 0.825
## month7 -3.570e-01 1.339e+00 -0.267 0.790
## month8 -5.233e-01 1.340e+00 -0.391 0.697
## month9 -7.162e-01 1.340e+00 -0.534 0.594
## month10 -7.825e-01 1.341e+00 -0.584 0.560
## month11 -5.690e-01 1.342e+00 -0.424 0.672
## month12 -6.492e-01 1.342e+00 -0.484 0.629
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.664 on 165 degrees of freedom
## Multiple R-squared: 0.722, Adjusted R-squared: 0.6984
## F-statistic: 30.61 on 14 and 165 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(mlr.cubic.month, main="",which = 1:4)
From the above summary we can observe that the monthly indicators are
not significant and has no relationship with the time series data.
per=12
sets=length(time_train_transportation)/per
month=factor(rep(1:per,sets))
mlr.quad.month=lm( train_transportation ~ time_train_transportation+time_sqft_train_transportation +month )
summary(mlr.quad.month)
##
## Call:
## lm(formula = train_transportation ~ time_train_transportation +
## time_sqft_train_transportation + month)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.1380 -2.2869 0.1722 3.4551 6.6203
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 98.8207233 1.3895754 71.116 < 2e-16 ***
## time_train_transportation 0.1177538 0.0241561 4.875 2.53e-06 ***
## time_sqft_train_transportation -0.0001741 0.0002585 -0.674 0.501
## month2 0.2438019 1.5289974 0.159 0.874
## month3 0.0211113 1.5290342 0.014 0.989
## month4 -0.2214052 1.5290948 -0.145 0.885
## month5 0.0229192 1.5291789 0.015 0.988
## month6 -0.0059157 1.5292862 -0.004 0.997
## month7 -0.0079097 1.5294165 -0.005 0.996
## month8 -0.1163963 1.5295699 -0.076 0.939
## month9 -0.2513754 1.5297464 -0.164 0.870
## month10 -0.2595137 1.5299464 -0.170 0.866
## month11 0.0125222 1.5301701 0.008 0.993
## month12 -0.0086012 1.5304182 -0.006 0.996
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.187 on 166 degrees of freedom
## Multiple R-squared: 0.6348, Adjusted R-squared: 0.6062
## F-statistic: 22.2 on 13 and 166 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(mlr.quad.month, main="",which = 1:4)
From the above summary we can observe that the monthly indicators are
not significant and has no relationship with the time series data and
the residual plots are not significant.
per=12
sets=length(time_train_transportation)/per
month=factor(rep(1:per,sets))
mlr.lin.month=lm( train_transportation ~ time_train_transportation +month )
summary(mlr.lin.month)
##
## Call:
## lm(formula = train_transportation ~ time_train_transportation +
## month)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.9479 -2.3026 0.0208 3.6013 6.7812
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 99.297173 1.194169 83.152 <2e-16 ***
## time_train_transportation 0.101994 0.006010 16.971 <2e-16 ***
## month2 0.244673 1.526495 0.160 0.873
## month3 0.022679 1.526530 0.015 0.988
## month4 -0.219315 1.526589 -0.144 0.886
## month5 0.025357 1.526672 0.017 0.987
## month6 -0.003304 1.526778 -0.002 0.998
## month7 -0.005298 1.526909 -0.003 0.997
## month8 -0.113958 1.527062 -0.075 0.941
## month9 -0.249286 1.527240 -0.163 0.871
## month10 -0.257946 1.527441 -0.169 0.866
## month11 0.013393 1.527665 0.009 0.993
## month12 -0.008601 1.527914 -0.006 0.996
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.18 on 167 degrees of freedom
## Multiple R-squared: 0.6338, Adjusted R-squared: 0.6075
## F-statistic: 24.09 on 12 and 167 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(mlr.lin.month, main="",which = 1:4)
From the above summary we can observe that the monthly indicators are
not significant and has no relationship with the time series data and
the residual plots are not significant.
Following is the code to fit the cos and sin trignomietric functions.
t=1:length(train_transportation)
per=12
# Model 1
c1=cos(2*pi*t/per)
s1=sin(2*pi*t/per)
accd_trig1=lm(time_train_transportation~c1+s1)
summary(accd_trig1)
##
## Call:
## lm(formula = time_train_transportation ~ c1 + s1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -88.50 -46.77 0.00 46.77 88.50
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 90.500 3.900 23.204 <2e-16 ***
## c1 1.000 5.516 0.181 0.856
## s1 -3.732 5.516 -0.677 0.500
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 52.33 on 177 degrees of freedom
## Multiple R-squared: 0.002765, Adjusted R-squared: -0.008504
## F-statistic: 0.2453 on 2 and 177 DF, p-value: 0.7827
c2=cos(2*pi*2*t/per)
s2=sin(2*pi*2*t/per)
accd_trig2=lm(time_train_transportation~c1+s1+c2+s2)
summary(accd_trig2)
##
## Call:
## lm(formula = time_train_transportation ~ c1 + s1 + c2 + s2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -87.5 -47.3 0.0 47.3 87.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 90.500 3.921 23.081 <2e-16 ***
## c1 1.000 5.545 0.180 0.857
## s1 -3.732 5.545 -0.673 0.502
## c2 1.000 5.545 0.180 0.857
## s2 -1.732 5.545 -0.312 0.755
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 52.61 on 175 degrees of freedom
## Multiple R-squared: 0.003505, Adjusted R-squared: -0.01927
## F-statistic: 0.1539 on 4 and 175 DF, p-value: 0.961
From the above both the summaries we can observe that there is no effect and relationship of the c1, c2, s1 and s2 on the data
Following is the code to fit the auto auto regressive integrated moving average(arima) functions to figure out the best order to fit the model based on the data.
library(forecast)
library(astsa)
##
## Attaching package: 'astsa'
## The following object is masked from 'package:forecast':
##
## gas
auto.arima(train_transportation,ic="bic")
## Series: train_transportation
## ARIMA(0,1,0)
##
## sigma^2 = 1.431: log likelihood = -286.05
## AIC=574.1 AICc=574.12 BIC=577.29
The ARIMA(0,1,0) model was fitted to the series “train_transportation.” The estimated variance of the residuals (sigma^2) is 1.431, and the log likelihood of the model is -286.05 based on the BIC values.
auto.arima(train_transportation,ic="aic")
## Series: train_transportation
## ARIMA(3,1,2) with drift
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2 drift
## -0.4840 -0.8287 0.0163 0.3438 0.8708 0.111
## s.e. 0.1138 0.1054 0.0874 0.0862 0.0929 0.082
##
## sigma^2 = 1.34: log likelihood = -277.47
## AIC=568.94 AICc=569.6 BIC=591.25
The auto.arima function identified an ARIMA(3,1,2) model with drift as the best-fitting model for the “train_transportation” series based on the Akaike Information Criterion (AIC). This model includes three autoregressive terms (AR) at lag 1, 2, and 3, and two moving average terms (MA) at lag 1 and 2, along with a drift term. The estimated coefficients for the AR and MA terms are provided, along with their standard errors. The estimated variance of the residuals (sigma^2) is 1.34, and the log likelihood of the model is -277.47. Lower values of AIC, AICc, and BIC indicate better model fit.
d <- 1 # no differencing needed
np <- 3 # maximum AR order
nq <- 3 # maximum MA order
outarima.all <- matrix(nrow=(np+1)*(nq+1), ncol=3)
colnames(outarima.all) <- c("p", "q", "BIC")
# Check for missing values
if (any(is.na(train_transportation))) {
stop("Input data contains missing values.")
}
for (p in 0:np) {
for (q in 0:nq) {
# Try fitting SARIMA model
tryCatch({
gnpgr.fit <- sarima(train_transportation, p, d, q, details = FALSE)
outarima.all[p*(nq+1) + q + 1, ] <- c(p, q, gnpgr.fit$BIC)
}, error = function(e) {
cat("Error occurred for p =", p, "and q =", q, ":", conditionMessage(e), "\n")
})
}
}
outarima.all
## p q BIC
## [1,] 0 0 3.244849
## [2,] 0 1 3.250873
## [3,] 0 2 3.272257
## [4,] 0 3 3.274025
## [5,] 1 0 3.248966
## [6,] 1 1 3.277943
## [7,] 1 2 3.278535
## [8,] 1 3 3.302465
## [9,] 2 0 3.277938
## [10,] 2 1 3.291620
## [11,] 2 2 3.274307
## [12,] 2 3 3.328085
## [13,] 3 0 3.270690
## [14,] 3 1 3.295343
## [15,] 3 2 3.303092
## [16,] 3 3 3.326566
outarima.all[which.min(outarima.all[,3]),]
## p q BIC
## 0.000000 0.000000 3.244849
The code snippet performs a grid search to identify the best-fitting SARIMA model for the “train_transportation” series based on the Bayesian Information Criterion (BIC). It iterates over different combinations of AR (p) and MA (q) orders, ranging from 0 to 3. For each combination, it attempts to fit a SARIMA model and stores the resulting BIC value in a matrix. If an error occurs during model fitting, it handles the error and continues the iteration. After completing the grid search, it selects the model with the lowest BIC value as the best-fitting model, which corresponds to an ARIMA(0,1,0) model. This model indicates that no differencing is needed, with no autoregressive or moving average terms, as indicated by p = 0 and q = 0, respectively.
freight_auto_arima = arima(train_transportation,order=c(0,1,0), method='CSS',
include.mean=TRUE)
summary(freight_auto_arima)
##
## Call:
## arima(x = train_transportation, order = c(0, 1, 0), include.mean = TRUE, method = "CSS")
##
##
## sigma^2 estimated as 1.431: part log likelihood = -286.05
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 0.1138889 1.192826 0.885 0.09539694 0.8264137 0.9944444 -0.1568116
ar1.resid=na.omit(freight_auto_arima$resid)
ar1.resid
## Time Series:
## Start = 1
## End = 180
## Frequency = 1
## [1] 0.0 -0.4 -1.9 -1.2 1.5 0.2 -1.8 1.3 1.3 -0.6 0.3 -1.8 1.2 -0.1 0.2
## [16] -1.0 1.1 -0.7 -0.7 1.3 -6.3 1.3 0.3 0.4 1.0 1.5 -0.9 0.7 0.9 0.4
## [31] 0.7 0.0 0.4 0.9 0.7 0.1 -0.6 -0.5 -0.1 -0.9 -0.3 0.3 1.6 -0.3 1.2
## [46] 1.2 0.2 2.1 -0.8 0.9 2.0 0.3 -0.1 0.8 -0.1 -0.6 0.4 0.9 0.5 0.2
## [61] 1.8 -0.5 -0.2 0.5 -0.3 -0.5 0.1 0.3 -0.5 -0.7 2.2 -1.2 0.6 -0.6 0.3
## [76] -0.3 1.6 -0.3 -1.0 -1.9 2.8 -1.6 -1.2 1.6 -0.3 0.3 1.3 -1.1 0.3 -0.4
## [91] -0.7 0.7 0.1 1.2 -0.2 -0.2 2.0 -1.6 -0.9 1.4 -0.1 -1.3 1.0 -1.8 -2.1
## [106] 0.0 -2.6 -2.8 -1.8 2.3 -4.5 -0.1 -0.7 1.3 1.8 0.9 0.1 -0.9 1.4 1.0
## [121] 1.0 1.0 0.9 0.1 0.6 0.3 0.3 0.1 0.6 0.7 -0.4 0.5 1.2 -0.6 1.2
## [136] -0.5 -0.7 1.0 0.1 -0.4 1.0 0.4 0.3 2.0 -2.4 0.7 -0.3 0.0 0.5 0.5
## [151] -0.1 -0.3 -0.4 -1.8 1.6 0.4 1.9 1.2 -0.7 -0.3 0.5 0.7 -0.4 0.4 0.0
## [166] 0.1 2.0 -1.3 -1.0 1.6 1.8 0.3 0.4 -1.2 0.7 0.2 0.9 0.3 0.5 0.2
par(mfrow=c(2,2))
ts.plot(ar1.resid,main="Residuals from AR(1) fit" )
hist(ar1.resid,main="histogram",xlab="resid")
qqnorm(ar1.resid,main="Normal Q-Q plot",xlab="resid"); qqline(ar1.resid)
shapiro.test(ar1.resid)
##
## Shapiro-Wilk normality test
##
## data: ar1.resid
## W = 0.93338, p-value = 2.256e-07
acf(as.numeric(ar1.resid), lag.max=40, main="")
Box.test(ar1.resid, lag=4, fitdf=1)
##
## Box-Pierce test
##
## data: ar1.resid
## X-squared = 10.329, df = 3, p-value = 0.01596
Box.test(ar1.resid, lag=8, fitdf=1)
##
## Box-Pierce test
##
## data: ar1.resid
## X-squared = 16.103, df = 7, p-value = 0.02419
Box.test(ar1.resid, lag=4, type="Ljung", fitdf=1)
##
## Box-Ljung test
##
## data: ar1.resid
## X-squared = 10.569, df = 3, p-value = 0.0143
The p-value obtained (2.256e-07) suggests significant departure from normality, indicating non-normality in the residuals.
The Box-Pierce and Box-Ljung tests evaluate the autocorrelation of the residuals. Both tests yield p-values below conventional significance levels (0.05), suggesting that there is significant autocorrelation present in the residuals. This indicates that the residuals are not independent, violating one of the assumptions of the ARIMA model.
rec.yw.fore <- predict(freight_auto_arima, n.ahead=24)
rec.yw.fore
## $pred
## Time Series:
## Start = 181
## End = 204
## Frequency = 1
## [1] 122.4 122.4 122.4 122.4 122.4 122.4 122.4 122.4 122.4 122.4 122.4 122.4
## [13] 122.4 122.4 122.4 122.4 122.4 122.4 122.4 122.4 122.4 122.4 122.4 122.4
##
## $se
## Time Series:
## Start = 181
## End = 204
## Frequency = 1
## [1] 1.196153 1.691616 2.071798 2.392306 2.674680 2.929965 3.164724 3.383232
## [9] 3.588459 3.782568 3.967191 4.143596 4.312791 4.475595 4.632681 4.784612
## [17] 4.931865 5.074848 5.213910 5.349359 5.481462 5.610455 5.736549 5.859929
U.yw <- rec.yw.fore$pred + rec.yw.fore$se
L.yw <- rec.yw.fore$pred - rec.yw.fore$se
month <- 181:204
plot(month,ts_transportation_filtered[month],type="o",xlim=c(181, 204),ylab="recruits",
ylim=c(min(L.yw), max(U.yw))) # data
lines(rec.yw.fore$pred,col="red",type="o") # point forecasts
lines(U.yw,col="blue",lty="dashed") # upper limit
lines(L.yw,col="blue",lty="dashed") # lower limit
efore.freight_auto_arima=test_transportation - rec.yw.fore$pred
nfore=length(test_transportation)
me.freight_auto_arima=mean(efore.freight_auto_arima) # Mean Error
mpe.freight_auto_arima=100*(mean(efore.freight_auto_arima/test_transportation)) # Mean Percent Error
mse.freight_auto_arima=sum(efore.freight_auto_arima**2)/nfore # Mean Squared Error
mae.freight_auto_arima=mean(abs(efore.freight_auto_arima)) # Mean Absolute Error
mape.freight_auto_arima=100*(mean(abs((efore.freight_auto_arima)/test_transportation))) # Mean Absolute Percent Error
me.freight_auto_arima
## [1] 0.5041667
mpe.freight_auto_arima
## [1] 0.405107
mse.freight_auto_arima
## [1] 1.029583
mae.freight_auto_arima
## [1] 0.8458333
mape.freight_auto_arima
## [1] 0.6856508
From the predictions of the 2 years data we can observe that the data points are close to some points however not completely close.
Following is the code to fit the order 0, 1, 1 and interpret the results.
freight_prof = arima(train_transportation,order=c(0,1,1), method='CSS',
include.mean=TRUE)
summary(freight_prof)
##
## Call:
## arima(x = train_transportation, order = c(0, 1, 1), include.mean = TRUE, method = "CSS")
##
## Coefficients:
## ma1
## -0.1296
## s.e. 0.0669
##
## sigma^2 estimated as 1.403: part log likelihood = -284.3
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE
## Training set 0.1306169 1.181218 0.8805218 0.1096842 0.8223829 0.9894125
## ACF1
## Training set -0.02390081
ar1.resid=na.omit(freight_prof$resid)
ar1.resid
## Time Series:
## Start = 1
## End = 180
## Frequency = 1
## [1] 0.000000000 -0.400000000 -1.951826854 -1.452892614 1.311752867
## [6] 0.369960061 -1.752065335 1.072989914 1.439024229 -0.413549753
## [11] 0.246417543 -1.768072385 0.970915927 0.025798795 0.203342676
## [16] -0.973653472 0.973846509 -0.573821498 -0.774348408 1.199669895
## [21] -6.144562209 0.503866679 0.365284562 0.447328874 1.057959121
## [26] 1.637076732 -0.687888658 0.610872237 0.979148966 0.526865526
## [31] 0.768264457 0.099541825 0.412897349 0.953497927 0.823541995
## [36] 0.206703977 -0.573217958 -0.574270209 -0.174406546 -0.922597356
## [41] -0.419538296 0.245641625 1.631827082 -0.088568840 1.188524389
## [46] 1.353993700 0.375433085 2.148643789 -0.521606380 0.832416956
## [51] 2.107853880 0.573108588 -0.025743962 0.796664429 0.003221528
## [56] -0.599582596 0.322313801 0.941761276 0.622021310 0.280593519
## [61] 1.836355698 -0.262068653 -0.233955485 0.469687058 -0.239143994
## [66] -0.530985202 0.031201769 0.304042724 -0.460606055 -0.759679407
## [71] 2.101570516 -0.927705529 0.479799852 -0.537833708 0.230314427
## [76] -0.270158819 1.564996296 -0.097227914 -1.012597542 -2.031199362
## [81] 2.536823318 -1.271311071 -1.364720133 1.423177122 -0.115603018
## [86] 0.285021648 1.336929438 -0.926777883 0.179920045 -0.376688275
## [91] -0.748806421 0.602979297 0.178126300 1.223079314 -0.041529117
## [96] -0.205380809 1.973389397 -1.344313590 -1.074178860 1.260821723
## [101] 0.063361058 -1.291790489 0.832626407 -1.692118982 -2.319243009
## [106] -0.300497672 -2.638934622 -3.141919199 -2.207089469 2.014033741
## [111] -4.239047418 -0.649241229 -0.784120326 1.198403776 1.955273744
## [116] 1.153339217 0.249434858 -0.867681440 1.287577002 1.166827663
## [121] 1.151182517 1.149155421 1.048892776 0.235902032 0.630565150
## [126] 0.381700520 0.349455843 0.145277992 0.618823253 0.780179156
## [131] -0.298914422 0.461270515 1.259765499 -0.436775793 1.143408212
## [136] -0.351851874 -0.745588439 0.903396242 0.217050463 -0.371877393
## [141] 0.951816912 0.523324190 0.367805616 2.047655520 -2.134691141
## [146] 0.423414185 -0.245139437 -0.031762015 0.495884687 0.564250358
## [151] -0.026891698 -0.303484280 -0.439321589 -1.856921640 1.359403983
## [156] 0.576134079 1.974648042 1.455849490 -0.511369753 -0.366256714
## [161] 0.452545167 0.758634981 -0.301705839 0.360908839 0.046761924
## [166] 0.106058809 2.013741736 -1.039085253 -1.134631299 1.452989073
## [171] 1.988259631 0.557613104 0.472248332 -1.138812137 0.552447374
## [176] 0.271579024 0.935187716 0.421169593 0.554569738 0.271854012
par(mfrow=c(2,2))
ts.plot(ar1.resid,main="Residuals from AR(1) fit" )
hist(ar1.resid,main="histogram",xlab="resid")
qqnorm(ar1.resid,main="Normal Q-Q plot",xlab="resid"); qqline(ar1.resid)
shapiro.test(ar1.resid)
##
## Shapiro-Wilk normality test
##
## data: ar1.resid
## W = 0.93095, p-value = 1.462e-07
From the shapiro test we can observe that the residuals are not normal.
acf(as.numeric(ar1.resid), lag.max=40, main="")
Box.test(ar1.resid, lag=4, fitdf=1)
##
## Box-Pierce test
##
## data: ar1.resid
## X-squared = 7.1294, df = 3, p-value = 0.06788
Box.test(ar1.resid, lag=8, fitdf=1)
##
## Box-Pierce test
##
## data: ar1.resid
## X-squared = 12.407, df = 7, p-value = 0.08795
Box.test(ar1.resid, lag=4, type="Ljung", fitdf=1)
##
## Box-Ljung test
##
## data: ar1.resid
## X-squared = 7.3277, df = 3, p-value = 0.06216
The Box-Pierce and Box-Ljung tests assess the autocorrelation of the residuals from the ARIMA(1,0,0) model. The p-values obtained are 0.06788, 0.08795, and 0.06216, respectively. These p-values are all above the conventional significance level of 0.05. Therefore, we fail to reject the null hypothesis that there is no autocorrelation in the residuals. This suggests that the residuals are relatively independent, indicating that the ARIMA(1,0,0) model adequately captures the temporal dependence in the data.
rec.yw.fore <- predict(freight_prof, n.ahead=24)
rec.yw.fore
## $pred
## Time Series:
## Start = 181
## End = 204
## Frequency = 1
## [1] 122.3648 122.3648 122.3648 122.3648 122.3648 122.3648 122.3648 122.3648
## [9] 122.3648 122.3648 122.3648 122.3648 122.3648 122.3648 122.3648 122.3648
## [17] 122.3648 122.3648 122.3648 122.3648 122.3648 122.3648 122.3648 122.3648
##
## $se
## Time Series:
## Start = 181
## End = 204
## Frequency = 1
## [1] 1.184513 1.570386 1.878604 2.142940 2.378073 2.591963 2.789501 2.973946
## [9] 3.147602 3.312165 3.468930 3.618911 3.762919 3.901615 4.035547 4.165175
## [17] 4.290888 4.413022 4.531866 4.647671 4.760660 4.871029 4.978953 5.084585
U.yw <- rec.yw.fore$pred + rec.yw.fore$se
L.yw <- rec.yw.fore$pred - rec.yw.fore$se
month <- 181:204
plot(month,ts_transportation_filtered[month],type="o",xlim=c(181, 204),ylab="recruits",
ylim=c(min(L.yw), max(U.yw))) # data
lines(rec.yw.fore$pred,col="red",type="o") # point forecasts
lines(U.yw,col="blue",lty="dashed") # upper limit
lines(L.yw,col="blue",lty="dashed") # lower limit
efore.freight_prof=test_transportation - rec.yw.fore$pred
nfore=length(test_transportation)
me.freight_prof=mean(efore.freight_prof) # Mean Error
mpe.freight_prof=100*(mean(efore.freight_prof/test_transportation)) # Mean Percent Error
mse.freight_prof=sum(efore.freight_prof**2)/nfore # Mean Squared Error
mae.freight_prof=mean(abs(efore.freight_prof)) # Mean Absolute Error
mape.freight_prof=100*(mean(abs((efore.freight_prof)/test_transportation))) # Mean Absolute Percent Error
me.freight_prof
## [1] 0.53939
mpe.freight_prof
## [1] 0.4337677
mse.freight_prof
## [1] 1.066341
mae.freight_prof
## [1] 0.8575744
mape.freight_prof
## [1] 0.6950459
From the predictions of the 2 years data we can observe that the data points are close to some points however not completely close.
freight_iterations = arima(train_transportation,order=c(3,1,2), method='CSS',
include.mean=TRUE)
freight_iterations
##
## Call:
## arima(x = train_transportation, order = c(3, 1, 2), include.mean = TRUE, method = "CSS")
##
## Coefficients:
## ar1 ar2 ar3 ma1 ma2
## -0.4059 -0.7420 0.0279 0.2854 0.8775
## s.e. 0.0825 0.0575 0.0699 0.0491 0.0515
##
## sigma^2 estimated as 1.246: part log likelihood = -273.65
ar1.resid=na.omit(freight_iterations$resid)
ar1.resid
## Time Series:
## Start = 1
## End = 180
## Frequency = 1
## [1] 0.000000000 0.000000000 0.000000000 0.000000000 -0.385774905
## [6] 0.081578850 -0.257081929 0.677667036 0.518692937 0.199867702
## [11] 0.472602255 -2.469955955 0.998857999 0.925426386 -0.040547356
## [16] -1.826979413 1.402236695 0.201909693 -1.428106395 0.696136323
## [21] -5.217688988 0.605002853 0.522574202 0.982149503 0.609859295
## [26] 1.158488803 -0.426018994 0.524799438 0.698544678 0.649992415
## [31] 0.712186872 -0.217766898 0.345457579 1.135340598 0.734994579
## [36] -0.165208196 -0.662913147 -0.354731373 -0.068025269 -0.964171348
## [41] -0.390733902 0.470762380 1.732802506 -0.327148122 0.829915066
## [46] 1.470091786 0.438112730 1.623095895 -0.680279207 0.897792997
## [51] 2.053855695 0.428041018 -0.443704891 0.677233456 0.338245302
## [56] -0.734988847 -0.027131056 1.272642262 0.839494812 -0.296697119
## [61] 1.575103602 0.175959321 -0.505276657 -0.012617945 0.215487841
## [66] -0.295615862 -0.444241989 0.364136349 -0.004170472 -1.001485696
## [71] 1.825942499 -0.454720156 0.292367435 -0.992664882 0.561873191
## [76] 0.070535604 1.204397634 -0.287102335 -0.901101997 -2.064081135
## [81] 2.674862870 -0.797478690 -1.838417455 1.071981919 0.810977477
## [86] 0.226828317 0.378173698 -0.648286861 0.662893909 -0.751019763
## [91] -0.976430994 1.048347675 0.433528442 0.735893564 -0.248647746
## [96] 0.031659403 1.946082718 -1.514131593 -1.335439177 1.501395511
## [101] 0.588489178 -1.762071617 0.345506058 -0.908298910 -2.096348148
## [106] -0.820679652 -2.034268548 -2.496129312 -2.368417134 2.430480126
## [111] -3.439223949 -1.321044300 -0.748926068 2.440140494 1.771885179
## [116] -0.032056638 0.219018999 -0.276192265 0.970390693 0.863117624
## [121] 1.372024790 0.959945567 0.542144663 0.182381744 0.752727049
## [126] 0.217791723 0.141529320 0.196140560 0.674660062 0.644737936
## [131] -0.449440638 0.402808853 1.366052807 -0.474032860 0.769490427
## [136] -0.295217384 -0.586778641 0.737872531 0.304784081 -0.332321713
## [141] 0.711323856 0.594940159 0.421576299 1.748316070 -2.245575179
## [146] 0.308150378 0.030049003 0.185617858 0.178525655 0.497506725
## [151] 0.175334222 -0.470132183 -0.629615697 -1.589964956 1.587131069
## [156] 0.667258868 1.716692680 1.147993021 -0.648227233 -0.569105252
## [161] 0.556559908 1.040441732 -0.521777101 -0.020984260 0.309878489
## [166] 0.337944613 1.661072622 -1.184533465 -1.165999510 1.545831765
## [171] 2.325739591 0.225604940 -0.292441853 -0.979755447 1.037545253
## [176] 0.146201560 0.581913935 0.499845989 0.630736482 -0.018156486
par(mfrow=c(2,2))
ts.plot(ar1.resid,main="Residuals from AR(1) fit" )
hist(ar1.resid,main="histogram",xlab="resid")
qqnorm(ar1.resid,main="Normal Q-Q plot",xlab="resid"); qqline(ar1.resid)
shapiro.test(ar1.resid)
##
## Shapiro-Wilk normality test
##
## data: ar1.resid
## W = 0.95251, p-value = 9.697e-06
Residuals are not normal from the above shapiro wilk test.
acf(as.numeric(ar1.resid), lag.max=40, main="")
Box.test(ar1.resid, lag=4, fitdf=1)
##
## Box-Pierce test
##
## data: ar1.resid
## X-squared = 3.8477, df = 3, p-value = 0.2784
Box.test(ar1.resid, lag=8, fitdf=1)
##
## Box-Pierce test
##
## data: ar1.resid
## X-squared = 5.0103, df = 7, p-value = 0.6587
Box.test(ar1.resid, lag=4, type="Ljung", fitdf=1)
##
## Box-Ljung test
##
## data: ar1.resid
## X-squared = 3.9671, df = 3, p-value = 0.265
rec.yw.fore <- predict(freight_iterations, n.ahead=24)
rec.yw.fore
## $pred
## Time Series:
## Start = 181
## End = 204
## Frequency = 1
## [1] 122.5045 122.3117 122.3180 122.4614 122.3931 122.3146 122.4011 122.4224
## [9] 122.3474 122.3645 122.4138 122.3790 122.3570 122.3931 122.3938 122.3661
## [17] 122.3778 122.3936 122.3777 122.3728 122.3870 122.3845 122.3748 122.3810
##
## $se
## Time Series:
## Start = 181
## End = 204
## Frequency = 1
## [1] 1.116062 1.486268 1.902308 2.267909 2.502995 2.734415 2.992456 3.199795
## [9] 3.377991 3.574081 3.760161 3.918751 4.078887 4.242898 4.390985 4.531475
## [17] 4.675861 4.814405 4.943905 5.073395 5.201977 5.324056 5.443199 5.562339
U.yw <- rec.yw.fore$pred + rec.yw.fore$se
L.yw <- rec.yw.fore$pred - rec.yw.fore$se
month <- 181:204
plot(month,ts_transportation_filtered[month],type="o",xlim=c(181, 204),ylab="recruits",
ylim=c(min(L.yw), max(U.yw))) # data
lines(rec.yw.fore$pred,col="red",type="o") # point forecasts
lines(U.yw,col="blue",lty="dashed") # upper limit
lines(L.yw,col="blue",lty="dashed") # lower limit
efore.freight_iterations=test_transportation - rec.yw.fore$pred
nfore=length(test_transportation)
me.freight_iterations = mean(efore.freight_iterations) # Mean Error
mpe.freight_iterations=100*(mean(efore.freight_iterations/test_transportation)) # Mean Percent Error
mse.freight_iterations=sum(efore.freight_iterations**2)/nfore # Mean Squared Error
mae.freight_iterations=mean(abs(efore.freight_iterations)) # Mean Absolute Error
mape.freight_iterations=100*(mean(abs((efore.freight_iterations)/test_transportation))) # Mean Absolute Percent Error
me.freight_iterations
## [1] 0.5212157
mpe.freight_iterations
## [1] 0.4190192
mse.freight_iterations
## [1] 1.03699
mae.freight_iterations
## [1] 0.8416413
mape.freight_iterations
## [1] 0.6820952
From the predictions of the 2 years data we can observe that the data points are close to some points however not completely close but little better than the previous two models.
# Create a data frame with the computed values
results <- data.frame(
Model = c("freight_3_1_2", "freight_0_1_1", "freight_0_1_0"),
ME = c(me.freight_iterations, me.freight_prof, me.freight_auto_arima),
MPE = c(mpe.freight_iterations, mpe.freight_prof, mpe.freight_auto_arima),
MSE = c(mse.freight_iterations, mse.freight_prof, mse.freight_auto_arima),
MAE = c(mae.freight_iterations, mae.freight_prof, mae.freight_auto_arima),
MAPE = c(mape.freight_iterations, mape.freight_prof, mape.freight_auto_arima)
)
# Print the table
print(results)
## Model ME MPE MSE MAE MAPE
## 1 freight_3_1_2 0.5212157 0.4190192 1.036990 0.8416413 0.6820952
## 2 freight_0_1_1 0.5393900 0.4337677 1.066341 0.8575744 0.6950459
## 3 freight_0_1_0 0.5041667 0.4051070 1.029583 0.8458333 0.6856508
Based on the provided metrics, the model “freight_0_1_0” has the lowest values for Mean Error (ME), Mean Percent Error (MPE), Mean Squared Error (MSE), Mean Absolute Error (MAE), and Mean Absolute Percent Error (MAPE). Therefore, the “freight_0_1_0” model performs better compared to the other models listed.
Based on the coefficients and their standard errors, the ARIMA model with order (3, 1, 2) appears to be better as it includes multiple autoregressive (AR) and moving average (MA) terms, indicating a more complex and potentially better capturing of the underlying patterns in the data. Additionally, the standard errors of the coefficients in the (3, 1, 2) model are relatively smaller compared to the (0, 1, 0) model, suggesting more precise estimates. However, model selection should also consider other factors such as model fit diagnostics and forecast performance.
Following is the code to verify for the arch effect on the data where it does not have the covid data.
freight_diff = diff(log(train_transportation))*100; plot(freight_diff, type = "l"); abline(h=0)
par(mar=c(4.5,5,1.5,1), mfcol=c(2,3))
acf(freight_diff, main="return", xlab="Lag", lag=50)
pacf(freight_diff, main="", xlab="Lag", lag=50)
acf(abs(freight_diff), main="abs return", xlab="Lag", lag=50)
pacf(abs(freight_diff), main="", xlab="Lag", lag=50)
acf(freight_diff^2, main="sqr return", xlab="Lag", lag=50)
pacf(freight_diff^2, main="", xlab="Lag", lag=50)
library(TSA)
## Registered S3 methods overwritten by 'TSA':
## method from
## fitted.Arima forecast
## plot.Arima forecast
##
## Attaching package: 'TSA'
## The following object is masked from 'package:readr':
##
## spec
## The following objects are masked from 'package:stats':
##
## acf, arima
## The following object is masked from 'package:utils':
##
## tar
McLeod.Li.test(y=freight_diff)
skewness(freight_diff)
## [1] -1.520745
kurtosis(freight_diff)
## [1] 6.456526
qqnorm(freight_diff); qqline(freight_diff)
shapiro.test(freight_diff)
##
## Shapiro-Wilk normality test
##
## data: freight_diff
## W = 0.91464, p-value = 1.067e-08
From the above x square return plots, mcleod test we can observe that there is no arch effect on the data. However the data is not normal based on the q-q- plot and shapiro wilk test.